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Abstract: Regular tree grammars and regular path expressions constitute core 
constructs widely used in programming languages and type systems. Neverthe- 
less, there has been little research so far on frameworks for reasoning about 
path expressions where node cardinality constraints occur along a path in a 
tree. We present a logic capable of expressing deep counting along paths which 
may include arbitrary recursive forward and backward navigation. The count- 
ing extensions can be seen as a generalization of graded modalities that count 
immediate successor nodes. While the combination of graded modalities, nomi- 
nals, and inverse modalities yields undecidable logics over graphs, we show that 
these features can be combined in a decidable tree logic whose main features 
can be decided in exponential time. Our logic being closed under negation, it 
may be used to decide typical problems on XPath queries such as satisfiability, 
type checking with relation to regular types, containment, or equivalence. 
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On the Count of Trees 



Resume : Ce document introduit une logique d'arbre decidable en temps 
exponentielle et qui est capable d'exprimer des contraintes de cardinalite sur 
chemins multidircctionnelle. 
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1 Introduction 

A fundamental peculiarity of XML is the description of regular properties. For 
example, in XML schema languages the content types of element definitions 
rely on regular expressions. In addition, selecting nodes in such constrained 
trees is also done by means of regular path expressions (a la XPath). In both 
cases, it is often interesting to be able to express conditions on the frequency of 
occurrences of nodes. 

Even if we consider simple strings, it is well known that some formal lan- 
guages easily described in English may require voluminous regular expressions. 
For instance, as pointed out in [T3], the language L2a2h of all strings over 
£ = {a, 6, c} containing at least two occurrences of a and at least two occur- 
rences of b requires a large expression, such as: 

£*a£*a£*&£*6£* u £*a£*6£*a£*6£* 

u £*a£*&£*6£*a£* u £*6£*6£*a£*a£* 

u £*&£*a£*6£*a£* u £*&£*a£*a£*6£*. 

If we add n to the operators for forming regular expressions, then the language 
L2a2b can be expressed more concisely as (£*a£*a£*) n (£*&£*&£*). In logical 
terms, conjunction offers a dramatic reduction in expression size, which is crucial 
when the complexity of the decision procedure depends on formula size. 

If we now consider a formalism equipped with the ability to describe nu- 
merical constraints on the frequency of occurrences, we get a second (exponen- 
tial) reduction in size. For instance, the above expression can be formulated 
as (£*a£*) 2 n (£*6£*) 2 . We can even write (£*a£*) 2 " n (£*&£*) 2 " (for any 
natural n) instead of a (much) larger expression. 

Different extensions of regular expressions with intersection, counting con- 
straints, and interleaving have been considered over strings, and for describing 
content models of sibling nodes in XML type languages [H [9l [15] . The complex- 
ity of the inclusion problem over these different language extensions and their 
combinations typically ranges from polynomial time to exponential space (see 
[9] for a survey). The main distinction between these works and the work pre- 
sented here is that we focus on counting nodes located along deep and recursive 
paths in trees. 

When considering regular tree languages instead of regular string languages, 
succinct syntax such as the one presented above is even more useful, as branching- 
results in a higher combinatorial complexity. In the case of trees, it is often 
useful to express cardinality constraints not only on the sequence of children 
nodes, but also in a particular region of a tree, such as a subtree. Suppose, for 
instance, that we want to define a tree language over £ where there is no more 
than 2 "b" nodes. This requires a quite large regular tree type expression such 



RR n" 7251 



On the Count of Trees 



4 



as: 

^root -* b\_x b <{] |c[x 6 < 2 ] |a[x b < 2 ] 

Xb<2 -> a;^b,&[^ 6 ] ,x^ 6 ,&[a;^ b ] ,a;^b | x^ b ,6[a;b<i] ,a;^b 

| x^ 6 ,a[x 6 < 2 ] | 2^ b ,c[x 6 <2] | Xb<i 

Xb<i -» | x^b,b[.x^b\ ,x^b | a[xh<i] | c[ccb<i] 

x^b -* (aCx^bl \cLx^ b ~\)* 

where x roo t is the starting non-terminal; x^b,Xb<\,Xb<2 are non-terminals; the 
notation a [x^b] describes a subtree whose root is labeled a and in which there 
is no b node; and "," is concatenation. 

More generally, the widely adopted notations for regular tree grammars pro- 
duce very verbose definitions for properties involving cardinality constraints on 
the nesting of element^. 

The problem with regular tree (and even string) grammars is that one is 
forced to fully expand all the patterns of interest using concatenation, union, 
and Kleene star. Instead, it is often tempting to rely on another kind of (for- 
mal) notation that just describes a simple pattern and additional constraints 
on it, which are intuitive and compact with respect to size. For instance, one 
could imagine denoting the previous example as follows, where the additional 
constraint is described using XPath notation: 

(x^(alxl |6[x] |c[x])*) a count(/descendant-or-self::6) < 2 

Although this kind of counting operators does not increase the expressive 
power of regular tree grammars, it can have a drastic impact on succinctness, 
thus making reasoning over these languages harder (as noticed in [7] in the case 
of strings). Indeed, reasoning on this kind of extensions without relying on their 
expansion (in order to avoid syntactic blow-ups) is often tricky [8] . Determining 
satisfiability, containment, and equivalence over these classes of extended regular 
expressions typically requires involved algorithms with higher complexity [2 2) 
compared to ordinary regular expressions. 

In the present paper, we propose a succinct logical notation, equipped with 
a satisfiability checking algorithm, for describing many sorts of cardinality con- 
straints on the frequency of occurrence of nodes in regular tree types. Regular 
tree types encompass most of XML types (DTDs, XML Schemas, RelaxNGs) 
used in practice today. 

XPath is the standard query language for XML documents, and it is an 
important part of other XML technologies such as XSLT and XQuery. XPath 
expressions are regular path expressions interpreted as sets of nodes selected 
from a given context node. One of the reasons why XPath is popular for web 
programming resides in its ability to express multidirectional navigation. In- 
deed, XPath expressions may use recursive navigation, to access descendant 
nodes, and also backward navigation, to reach previous siblings or ancestor 

1 This is typically the reason why the standard DTD for XHTML does not syntactically 
prevent the nesting of anchors, whereas this nesting is actually prohibited in the XHTML 
standard. 
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nodes. Expressing cardinality restrictions on nodes accessible by recursive mul- 
tidirectional paths may introduce an extra-exponential cost 127] . or may 
even lead to undecidable formalisms |27[ [6] . We present in this paper a decid- 
able framework capable of succinctly expressing cardinality constraints along 
deep multidirectional paths. 

A major application of this logical framework is the decision of problems 
found in the static analysis of programming languages manipulating XML data. 
For instance, since the logic is closed under negation, it can be used to solve sub- 
typing problems such as XPath containment in the presence of tree constraints. 
Checking that a query q is contained in a query p with this logical approach 
amounts to verifying the validity of q => p, or equivalently, the unsatisfiability 
of q a -ip. 

Contributions We extend a tree logic with a succinct notation for counting 
operators. These operators allow arbitrarily deep and recursive counting con- 
straints. We present a sound and complete algorithm for checking satisfiability 
of logical formulas. We show that its complexity is exponential in the size of 
the succinct form. 

Outline We introduce the logic in Section[2l Section[3]shows how the logic can 
be applied in the XML setting, in particular for the static analysis of XPath 
expressions and of common schemas containing constraints on the frequency 
of occurrence of nodes. The decision procedure and the proofs of soundness, 
completeness, and complexity are presented in Section [4] Finally, we review 
related work in Section [5] before concluding in Section [5] 

2 Counting Tree Logic 

We introduce our syntax for trees, define a notion of trails in trees, then present 
the syntax and semantics of logical formulas. 

2.1 Trees 

We consider finite trees which are node-labeled and sibling-ordered. Since there 
is a well-known bijective encoding between n-axy and binary trees, we focus 
on binary trees without loss of generality. Specifically, we use the encoding 
represented in Figure [IJ where the binary representation preserves the first child 
of a node and append sibling nodes as second successors. 

The structure of a tree is built upon modalities "v" and ">". Modality 
"v" labels the edge between a node and its first child. Modality " t>" labels the 
edge between a node and its next sibling. Converse modalities "A" and "<" 
respectively label the same edges in the reverse direction. 

We define a Kripke semantics for our tree logic, similar to the one of modal 
logics [29]. We write M = {v,D>, A ,<} for the set of modalities. For m € M we 
denote by m the corresponding inverse modality (v = A ! > = <,A = v,<] = >). 



RR n" 7251 



On the Count of Trees 



G 



v 0\ 




Figure 1: n-ary to binary trees 

We also consider a countable alphabet P of propositions representing names of 
nodes. A node is always labeled with exactly one proposition. 

A tree is defined as a tuple (N,R,L), where N is a finite set of nodes; R is 
a partial mapping from N x M to N that defines a tree structure!! and L is a 
labeling function from N to P. 

2.2 Trails 

Trails are defined as regular expressions formed by modalities, as follows: 

a ::= ao | a* \ a* , a 
a ■■= m | a ,a | a I «o 

We restrict trails to sequences of repeated subtrails (which themselves contain no 
repetition) followed by a subtrail (with no repetition). Since we do not consider 
infinite paths, we also disallow trails where both a subtrail and its converse 
occurs under the scope of the recursion operator, thus ensuring cycle-freeness 
(see Section 12751) . These restrictions on trails allow us to prove the completeness 
of our approach while retaining the ability to express many counting formulas, 
such as the ones of XPath. 

Trails are interpreted as sets of paths. A path, written p, is a sequence of 
modalities that belongs to the regular language denoted by the trail, written 
pea. 

In a given tree, we say that there is a trail a from the node no to the node 
Ufe, written uq n^, if and only if there is a sequence of nodes no,...,nfe 
and a path p = mi, . . . , such that pea, and R(nj,mj + i) = rij+i for every 
j = 0,...,fc-l. 

2 For all n,n' e N,m e M, R(n,m) = n' <=> R(n',rn) = rt; for all n 6 N except one (the 
root), exactly one of R(n, A) or R(n, < ) is defined; for the root, neither R(n, A) nor R(n, < ) 
is defined. 
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$ 3 <ft ■■:= formula 



T | 


true, false 


P 1 ~<P 


atomic prop (negated) 


X 


recursion variable 


(ft V (ft 


disjunction 


(ft A (ft 


conjunction 


(m)(ft | ->(m)T 


modality (negated) 


(a) ik if) | {a) >k ift 


counting 


lix.ift 


fixpoint operator 


T | 




V 1 




X 




iftvift 




if) /\ if) 




(m)if) | -i(m)T 




jJLX.lf) 





Figure 2: Syntax of Formulas (in Normal Form). 
-i(m)<ft = -i(m)T v (m)-i</> ^fix.ift = fix.-iift{ x /^ x } 

Figure 3: Reduction to Negation Normal Form. 
2.3 Syntax of Logical Formulas 

The syntax of logical formulas is given in Figure HJ where m 6 M and k e N. 
Formulas written may contain counting subformulas, whereas formulas written 
if) cannot. We thus disallow counting under counting or under fixpoints. We also 
restrict formulas to cycle-free formulas, as detailed in Section |2"31 The syntax 
is shown in negation normal form. The negation of any closed formula (i.e., 
with no free variable) built using the syntax of Figure [2] may be transformed 
into negation normal form using the usual De Morgan rules together with rules 
given in Figure El When we write -i<ft, we mean its negated normal form. 

Defining an equality operator for counting formulas is straightforward using 
the other counting operators. 

(a) =k ifts(a) >( ^ k _ 1) iftA(a)^ k ift if k > 

{a) =0 ip = {a)< ip 
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ml 
iml 

Ml 

hPUl 

Ml 



N 








101 v (j) 2 ]\l 
[01 A (f) 2 )]l 

h(m)T]]l 



i(a)< k nl 
i(a) >k nl 



n'}\<k} 
n'}\>k} 



Figure 4: Semantics of Formulas. 



2.4 Semantics of Logical Formulas 

A formula is interpreted as a set of nodes in a tree. A model of a formula is 
a tree such that the formula denotes a non-empty set of nodes in this tree. A 
counting formula (cx) >k tp satisfied at a given node n means that there are at 
least k + 1 nodes satisfying ip that can be reached from n through the trail a. 
A counting formula {a) >k ip is thus interpreted as the set of nodes such that, 
for each of them, the previously described condition holds. For example, the 
formula pi a (v}( >*) >5 p 2 , denotes pi nodes with strictly more than 5 children 
nodes named p 2 ■ 

In order to present the formal semantics of formulas, we introduce valuations, 
written V, which relate variables to sets of nodes. We write V[ / x ]> where N' 
is a subset of the nodes, for the valuation defined as V[ N / x ](y) = V(y) iix + y, 
and V[ N / x ](x) = N'. Given a tree T = (N,R,L) and a valuation V, the formal 
semantics of formulas is given in Figure 01 

Note that the function / :Y -> [[VOly^/ ] is monotone, and the denotation 
of \ix. ip is a fixed point |26j . 

Intuitively, propositions denote the nodes where they occur; negation is in- 
terpreted as set complement; disjunction and conjunction are respectively set 
union and intersection; the least fixpoint operator performs finite recursive nav- 
igation; and the counting operator denotes nodes such that the ones accessible 
from this node through a trail fulfill a cardinality restriction. A logical formula 
is said to be satis fiable iff it has a model, i.e., there exists a tree for which the 
semantics of the formula is not empty. 
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2.5 Cycle-Freeness 

Formal definition of cycle-freeness can be found in [TO. . Intuitively, in a cycle- 
free formula, fixpoint variables must occur under a modality but cannot occur 
in the scope of both a modality and its converse. For instance, the formula 
fix.( v)xv( A)cc is not cycle-free. In a cycle-free formula, the number of modality 
cycles (of the form mm) is bound independently of the number of times fixpoints 
are unfolded (i.e., by replacing a fixpoint variable with the fixpoint itself). A 
fundamental consequence of the restriction to cycle- free formulas is that, when 
considering only finite trees, the interpretations of the greatest and smallest 
fixpoints coincide. This greatly simplifies the logic. 

Here, we also restrict our approach to cycle-free formulas. We thus need to 
extend this notion to the counting operators, and more precisely to the trails 
that occur in them. Cycle-free trails are trails where both a subtrail and its 
converse do not occur under the scope of the recursion operator. We thus 
restrict the formulas under consideration to cycle-free formulas whose counting 
operators contain cycle-free trails. 

Lemma 2.1. Let 4> be a cycle-free formula, and T be a tree for which [[^jj + 0. 
Then there is a finite unfolding cf>' of the fixpoints of <f> such that \[(t>'{" T /iix V'lllia = 

ml- 

Proof. As cycle-free counting formulas may be translated into (exponentially 
larger) cycle-free non-counting formulas, the proof is identical to the one in 

cm- □ 

As a consequence, our logic is closed under negation even without greatest 
fixpoints. 

2.6 Global Counting Formulas and Nominals 

To conclude this section, we turn to an illustration of the expressive power of 
our logic. An interesting consequence of the inclusion of backward axes in trails 
is the ability to reach every node in the tree from any node of the tree, using 
the trail (A|< )*, (v| >)*0 We can thus select some nodes depending on some 
global counting property. Consider the following formula, where stands for 
one of the comparison operators <, >, or =. 

((A|<)*,(v|>)*) #jfc & 

Intuitively, this formula counts how many nodes in the whole tree satisfy <j>\. 
For each node of the tree, it selects it if and only if the count is compatible with 
the comparison considered. The interpretation of this formula is thus either 
every node of the tree, or none. It is then easy to restrict the selected nodes to 
some that satisfy another formula fa, using intersection. 

«( A i«r,(vi>r) #fc 0iW 2 

3 Note that this trail is cycle-free. 
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This formula select every node satisfying 02 if and only if there are nodes 
satisfying 0i, which we write as follows. 

01#fc => 02 

We can now express existential properties, such as "select every node satisfying 
02 if there exists a node satisfying <f>i" , 

01 > ==> 02 

We can also express universal properties, such as "select every node satisfying 
<p2 if every node satisfies 0i". 

hM<o =^ 02 

Another way to interpret global counting formulas is as a generalization 
of the so-called nominals in the modal logics community [21]. Nominals are 
special propositions whose interpretation is a singleton (they occur exactly once 
in the model). They come for free with the logic. A nominal, denoted "@n", 
corresponds to the following global counting formula: 

[((A|<)*,( v |>)*) = in]An 

where n is a new fresh atomic proposition. 

One may need for nominals to occur in the scope of counting formulas. As we 
disallow counting under counting, we propose the following alternative encoding 
of nominals in these cases: 

@n = n a -i [descendant (n) v ancestor(n)v 

anc-or-^elf ( siblings ( desc-or-self (n ) ) ) ] , 



where: 



descendant (ip) = (v)fJ'X.'ip V (v)% V ( >)sg; 
foil-sibling (ip) = (ix.{ >)ip v ( \>}x; 



prec-sibling(?/;) = \ix 
desc-or-self(V>) = fix 
ancestor (ip) = \ix 
anc-or-self(f/') = [ix 



(«)V>v(<)x; 

0v (v)tiy-xv (>>y; 

(A)(^ vi) v (<)ir; 

i> vw.( A )(jvi) v {<\)y- 



siblings^) = forj-sibling(^) v prec-sibling(^). 



3 Application to XML Trees 
3.1 XPath Expressions 

XPath [3] was introduced as part of the W3C XSLT transformation language to 
have a non-XML format for selecting nodes and computing values from an XML 
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document (see [10] for a formal presentation of XPath) . Since then, XPath has 
become part of several other standards, in particular it forms the "navigation 
subset" of the XQuery language. 

In their simplest form XPath expressions look like "directory navigation 
paths" . For example, the XPath 

/company /personnel/employee 

navigates from the root of a document through the top-level "company" node 
to its "personnel" child nodes and on to its "employee" child nodes. The result 
of the evaluation of the entire expression is the set of all the "employee" nodes 
that can be reached in this manner. At each step in the navigation, the selected 
nodes for that step can be filtered with a predicate test. Of special interest to 
us are the predicates that count nodes or that test the position of the selected 
node in the previous step's selection. For example, if we ask for 

/ company/personnel/employee [position()=2] 

then the result is all employee nodes that are the second employee node (in doc- 
ument order) among the employee child nodes of each personnel node selected 
by the previous step. 

XPath also makes it possible to combine the capability of searching along 
"axes" other than the shown "children of" with counting constraints. For ex- 
ample, if we ask for 

/ company [count (descendant : : employee) <=300] /name 

then the result consists of the company names with less than 300 employees in 
total (the axis "descendant" is the transitive closure of the default - and often 
omitted - axis "child" ) . 

The syntax and semantics of Core XPath expressions are respectively given 
on Figure [S] and Figure |B] An XPath expression is interpreted as a relation be- 
tween nodes. The considered XPath fragment allows absolute and relative paths, 
path union, intersection, composition, as well as node tests and qualifiers with 
counting operators, conjunction, disjunction, negation, and path navigation. 
Furthermore, it supports all XPath axes allowing multidirectional navigation. 

It was already observed in |11[ 127] that using positional information in paths 
reduces to counting (at the cost of an exponential blow-up). For example, the 
expression 

child: :a[position()=5] 

first selects the "a" nodes occurring as children of the current context node, and 
then keeps those occurring at the 5th position. This expression can be rewritten 
into the semantically equivalent expression: 

child: : a [count (preceding-sibling: :a)=4] 
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Axis ::= 



NameTest : 
Step : 
PathExpr : 
Qualifier : 

CountExpr : 
PathExpr' : 
Qualifier' : 

Comp : 
XPath : 



self | child | parent | descendant | ancestor | 
following-sibling | preceding-sibling | 
following | preceding 
QName | * 
Axis::NameTest 

PathExpr/PathExpr | PathExpr [Qualifier] | Step 
PathExpr | CountExpr | not Qualifier | 
Qualifier and Qualifier | Qualifier or Qualifier | @n 
count(PathExpr') Comp k 

PathExpr'/PathExpr' | PathExpr' [Qualifier'] | Step 
PathExpr' | not Qualifier' | Qualifier' and Qualifier' 

| Qualifier' or Qualifier' | @n 

<l>l>l<l= 

PathExpr | /PathExpr | XPath union PathExpr | 
XPath intersect PathExpr | XPath except PathExpr 



Figure 5: Syntax of Core XPath Expressions. 



which constraints the number of preceding siblings named "a" to 4, so that 
the qualifier becomes true only for the 5th child "a" . A general translation of 
positional information in terms of counting operators jTTJ [27J is summarized on 
FigureUJ where « denotes the document order (depth-first left-to-right) relation 
in a tree. Note that translated path expressions can in turn be expressed into the 
core XPath fragment of Figure [5] (at the cost of another exponential blow-up) . 
Indeed, expressions like PathExpr/(PathExpr 2 except PathExpr 3 )/PathExpr 4 
must be rewritten into expressions where binary connectives for paths occur 
only at top level, as in: 

PathExpr/PathExpr 2 /PathExpr 4 except 
PathExpr/PathExpr 3 /PathExpr 4 

We focus on Core XPath expressions involving the counting operator (see 
Figure [5]) . The XPath fragment without the counting operator (the naviga- 
tional fragment) was already linearly translated into /u-calculus in [10] . The 
contributions presented in this paper allow to equip this navigational fragment 
with counting features such as the ones formulated above. Logical formulas 
capture the aforementioned XPath counting constraints. For example, consider 
the following XPath expression: 

child : : a [count (descendant : : b [parent : : c] ) >5] 



RR n° 7251 



On the Count of Trees 



13 



[Axis::NameTest] 



[/PathExpr] 



{(x,y) € N 2 | x(Axis)y and 
y satisfies NameTest} 
:{(r,y)e [PathExpr] | 
r is the root} 



[jyiy =[Pi] o [p 2 ] 

[Pi union P 2 1 =[J\] u [P 2 ] 
[Pi intersect P 2 J =[Pi] n [P 2 ] 
[•Pi except P 2 ] =[Pi] x [P 2 ] 
[PathExpr[Qualifier]]] ={(x,y) € [PathExpr] | 



[not Q] Quaiif =N \ [Q] Quaiif 

[Qi and Q 2 ]Quaiif =[Qi] Quaiif n [Qi]Q U alif 

{Ql Or Q 2 ]Q U alif =[Q2]Qualit U [Qa] Quaiif 

Figure 6: Semantics of Core XPath Expressions 



PathExpr[position() = 1] sPathExpr except (PathExpr/ «) 
PathExpr[position() = k + 1] ^(PathExpr intersect 

(PathExpr[fc]/«))[position() = l] 
«s(descendant::*) union (a-o-s::*/ 
following-sibling: : */d-or-s : :*) 
a-or-s::* Eancestor::* union self::* 
d-or-s::* ^descendant::* union self::* 

Figure 7: Positional Information as Syntactic Sugars [TTJ [57] 



y € [Qualifier] Quaiif} 



[PathExpr] Q ua iif 
[count (PathExpr) Comp /c]Q ua iif 



:{x\3y.(x,y) e [PathExpr]} 
:{xeN\ 

| {y|(z,y) 6 [PathExpr]} | 
satishes Comp k} 
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Path 


Logical formula 


7/self::* 


7 


7/child::* 


(«*,A)7 


7/parent::* 


(v}(>*)7 


7/descendant::* 


(« |A)*,A) 7 


7/ancestor::* 


(v)((v| »*> 7 


7/following-sibling: : * 


(<)(<*>7 


7 /preceding-sibling: : * 


(>)(>*h 



Figure 8: XPath axes as modalities over binary trees. 

This expression selects the children nodes named "a" provided they have more 
than 5 descendants which (1) are named "b" and (2) whose parent is named 
"c" . The logical formula denoting the set of children nodes named "a" is: 

if) = a a (< *, a)t 

The logical translation of the above XPath expression is: 

i 3 A ( V > ( ( V I >)* > >5 (b A fix. { A }c v (< )x) 

This formula holds for nodes selected by the XPath expression. A correspon- 
dence between the main XPath axes over unranked trees and modal formulas 
over binary trees is given in Figure [51 In this figure, each logical formula holds 
for nodes selected by the corresponding XPath axis from a context 7. 
Let consider another example (XPath expression ei): 

child : : a/ child : :b [count (child : : e/descendant : :h) >3] 

Starting from a given context in a tree, this XPath expression navigates to 
children nodes named "a" and selects their children named "b". Finally, it 
retains only those "b" nodes for which the qualifier between brackets holds. 
The first path can be translated in the logic as follows: 

= b A flX.(A)(a a fix'. (A)j v (< )x') v (< )x 

The counting part requires a more sophisticated translation in the logic. This 
is because it makes implicit that "e" nodes (whose existence is simply tested for 
counting purposes) must be children of selected "b" nodes. The translation of 
the full aforementioned XPath expression is as follows: 

1? A @n A ( ( A |<)*,(v I >)*) >3 T) 

where @n is a new fresh nominal used to mark a "b" node which is filtered by 
the qualifier and the formula rf describes the counted "h" nodes: 

77 = h a fix.(A)(e a fix'.(A)@n v (< >a;') v(<)iv (A)x 
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Intuitively, the general idea behind the translation is to first translate the leading 
path, use a fresh nominal for marking a node which is filtered, then find at least 
"3" instances of "h" nodes from which we can reach back the marked node via 
the inverse path of the counting formula. 

Since trails make it possible to navigate but not to test properties (like 
existence of labels), we test for labels in the counted formula rj and we use a 
general navigation (A | <)*,(v | >)* to look for counted nodes everywhere 
in the tree. Introducing the nominal is necessary to bind the context properly 
(without loss of information). Indeed, the XPath expression e± makes implicit 
that a "e" node must be a child of a "b" node selected by the outer path. Using 
a nominal, we restore this property by connecting the counted nodes to the 
initial single context node. 

Lemma 3.1. The translation of Core XPath expressions with counting con- 
straints into the logic is linear. 

It is proven by structural induction in a similar manner to |10j (in which the 
translation is proven for expressions without counting constraints). For counting 
formulas, the use of nominals and the general (constant-size) counting trail make 
it possible to avoid duplication of trails so that the translation remains linear. 

We can now address several decision problems such as equivalence, con- 
tainment, and emptiness of XPath expressions. These decision problems are 
reduced to test satisfiability for the logic (in the manner of [TU]). We present in 
Section U a satisfiability testing algorithm with a single exponential complexity 
with respect to the formula size. 

In [10], it was show the logic is also able to capture XML schema languages. 
This allows to test the XPath decision problems in the presence of XML types. 
We now show our logic can also succinctly express cardinality constraints on 
XML types. 

3.2 Regular Tree Languages with Cardinality Constraints 

Regular tree grammars capture most of the schemas in use today [23] . The logic 
can express all regular tree languages (it is easy to prove that regular expression 
types in the manner of e.g., |14) can be linearly translated into the logic: see 
[TD]). 

In practice, schema languages often provide shorthands for expressing car- 
dinality constraints on node occurrences. XML Schema notably offers two at- 
tributes minOccurs and maxOccurs for this purpose. For instance, the following 
XML schema definition: 

<xsd:element name="a"> 
<xsd : complexType> 
<xsd : sequence> 

<xsd:element name="b" min0cciirs="4" max0ccurs="9"/> 
</xsd: sequence> 
</xsd : complexType> 
</xsd: element> 
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is a notation that restricts the number of occurrences of "b" nodes to be at 
least 4 and at most 9, as children of "a" nodes. The goal here is to have 
a succinct notation for expressing regular languages which could otherwise be 
exponentially large if written with usual regular expression operators. The above 
regular requirement can be translated as the formula: 

^(v)((>*)>3^(>*)< 9 6) 
where <f> corresponds to the regular tree type a[b*] as follows: 

<j) = (aA(n(v)Tv(v)?>))An(t>}T 

lp= fix. (b A -.(v)t a -.( >)t) v (b A -.(v)T a ( >)x) 

This example only involves counting over children nodes. The logic allows 
counting through more general trails, and in particular arbitrarily deep trails. 
Trails corresponding to the XPath axes "preceding, ancestor, following" can be 
used to constrain the context of a schema. The "descendant" trail can be used 
to specify additional constraints over the subtree defined by a given schema. 
For instance, suppose we want to forbid webpages containing nested anchors 
"a" (whose interpretation makes no sense for web browsers). We can build the 
logical formula / which is the conjunction of a considered schema for webpages 
(e.g. XHTML) with the formula a/descendant ::a in XPath notation. Nested 
anchors are forbidden by the considered schema iff / is unsatisfiablc. 

As another example, suppose we want paragraph nodes ( "p" nodes) not to be 
nested inside more than 3 unordered lists ("«/" nodes), regardless of the schema 
defining the context. One may check for the unsatisfiability of the following 
formula: 

P A<(A|<)*,A) >3U Z 

4 Satisfiability Algorithm 

We present a tableau-based algorithm for checking satisfiability of formulas. 
Given a formula, the algorithm seeks to build a tree containing a node selected 
by the formula. We show that our algorithm is correct and complete: a satisfying 
tree is found if and only if the formula is satisfiable. We also show that the time 
complexity of our algorithm is exponential in the size of the formula. 

4.1 Overview 

The algorithm operates in two stages. 

First, a formula <f> is decomposed into a set of subformulas, called the lean. 
The lean gathers all subformulas that are useful for determining the truth status 
of the initial formula, while eliminating redundancies. For instance, conjunctions 
and disjunctions are eliminated at this stage. More precisely, the lean (defined 
in 14. 2|) mainly gathers atomic propositions and modal subformulas. From the 
lean, one may gather a finite number of formulas, called a 0-node, which may 
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be satisfied at a given node of a tree. Trees of ^-nodes represent the exhaustive 
search universe in which the algorithm is looking for a satisfying tree. 

The second stage of the algorithm consists in the building of sets of such 
trees in a bottom-up manner, ensuring consistency at each step. Initially, all 
possible leaves (i.e., </>-node that do not require children nodes) are considered. 
During further steps, the algorithm considers every possible </>-node that can 
be connected with a tree of the previous steps, checking for consistency. For 
instance, if a formula at a 0-node n involve a forward modality (v )</>', then cj>' 
must be verified at the first child of n. Reciprocally, due to converse modalities, 
a 0-node may impose restrictions on its possible parent nodes. The new trees 
that are built may involve converse modalities, which will be satisfied during 
further steps of the algorithm. To ensure the algorithm terminates, a bound on 
the number of times each </>-node may occur in the tree is given. 

Finally, the algorithm terminates whenever: 

• either a tree that satisfies the initial formula has been found, and its root 
does not contain any pending (unproven) backward modality; or 

• every tree has been considered (the exploration of the whole search uni- 
verse is complete): the formula is unsatisfiablc. 

4.2 Preliminaries 

To track where counting formulas are satisfied, we annotate each one with a 
fresh counting proposition c, yielding formulas of the form (a)^ fc </>. To define 
the notions of lean and </>-nodes, we need to extract navigating formulas from 
counting formulas (Figure [H]). 

We now define the Fisher-Ladner relation to extract subformulas. In the 
following, i ranges over 1 and 2. 

^(0! A^ 2 ,^), fl"(0iV 

R^Ojlx.MI^]), Rf l ((a)% k i>,nav({a) c # k i>)), 
R fl {(m)<j>,4>). 

The Fisher-Ladner closure of a formula 4>, written FL{(f), is the set defined 
as follow. 

FL(<t>) = {</>}, 

FL(<j>) M = FL(<j>)iu{<i>'\Rf l (<j>",<f>'),<j>" tFLi&i}, 

FL(<f>) = FL(cf>) k , 

where k is the smallest integer s.t. FL(4>)k = FL(4>)k+i- Note that this set is 
finite since only one expansion of a fixpoint formula is required in order to 
produce all its subformulas in the closure. 

The lean of a formula (f> is a set of formulas containing navigating formulas 
of the form (m)l, every navigating formulas of the form (m)?/> (i.e., that do 
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nav(x) = x 
nav(j) = T 
nav(^p) = -ip 



nav(p) = p 
nav(c) = c 
nav(^(m)j) = ->(to)t 



nav((f>i a 2 ) 
nav((f>i v 02 ) 
nau((m)0) 
navdix.tp) 
nav({a) c >k ip) 
nav((a)l k tp) 
nav((e),ip) 
nav((m),ip) 
nav((ai 7 a2),ip) 
nav((ai \ a 2 ),0) 
nav{{a*),ip) 



- nav{(j>\) a nav(4>2) 

■ nav((f>i) v nav((f>2) 

- (m)nav((f)) 

- /ix.nav(ip) 
nav((a), ip a c) 

: nav((a), (0 A c) v (-.0 A -.c)) 

■ (m)0 

nav((ai),nav((ct2),ip)) 
nav((ai), 0) v nau((a2), "0) 
: HX.nav(tp) v nav((a), x) 



Figure 9: Navigation extraction from counting formulas 

not contain counting formulas) from FL(<j)), every proposition occurring in 0, 
written P^, every counting proposition, written C, and an extra proposition 
that does not occur in used to represent other names, written p^. 

lean(4>) = {{m)l} u {(to}0 e F £(0)} uF^uCu -jj^-} 
A <fi-node , written n^, is a subset of /ecm(0), such that: 

• exactly one proposition from i-0 u {p-^} is present; 

• when (m)0 is present, then (m)l is present; and 

• both ( a}t and (<d )t cannot be present at the same time. 

The set of 0-nodes is defined as 

Intuitively, a node ?70 corresponds to a formula. 

= /\ A f\ ^0 

i/iEti* ■ipilean(<f>)\n<t> 

When the formula under consideration is fixed, we often omit the super- 
script. 

A (fitree is either the empty tree 0, or a triple (71^,10,10) where 10 and 10 
are 0trees. When clear from the context, we usually refer to 0trees simply as 
trees. 
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i\) € n 



n i— ipi 



n i— 02 




n 




71 I— 0x A 02 



71 I— 01 V 02 



71 I— 01 V 02 



71 I— 02 



71 




Figure 10: Local entailment relation: between nodes and formulas 

We now turn to the definition of consistency of a 0tree. To this end, we 
define an entailment relation between a node and a formula in Figure 1101 

Two nodes m and 712 are consistent under modality m € {v, >}, written 



Consistency is checked each time a node is added to the tree, ensuring that 
forward modalities of the node are indeed satisfied by the nodes below, and that 
pending backward modalities of the node below are consistent with the added 
node. Note that counting formulas are not considered at this point, as they are 
globally verified in the next step. 

Upon generation of a finished tree, i.e., a tree with no pending backward 
modality, one may check whether a node of this tree satisfies 0. To this end, we 
first define forward navigation in a 0tree T. Given a path consisting of forward 
modalities p, is the node at that path. It is undefined if there is no such 
node. 



We also allow extending the path with backward modalities if they match the 
last modality of the path. 



Now, we are able to define an entailment relation along paths in 0trees 
in Figure [TT] This relation extends local entailment relation (Figure [TO]) with 
checks for counting formulas. Note that the case for fixpoints is contained in the 
case for formulas with no counting sub-formula. In the "less than" case, we need 
to make sure that every node reachable through the trail is taken into account, 
either as counted if it satisfies ip, or not counted otherwise (in this case, ->ijj 
denotes the negation normal form). 



^(ni,m) = 7i 2 , iff 



V(tti)0 € lean(ip), (m}0 6 tii <=> 712 \- ip 
V(7tT}0 e lean((f>), (m}0 e 712 <=> n\ 



(n,ri,r 2 )(e) =71 

(7i,r 1 ,r 2 )(vp) = r 1 ( /J ) 
(n,r 1 ,r 2 )(>p)=r 2 ( P ) 



(7i,ri,r 2 )(pv A) = (7i,r 1 ,r 2 )(p) 
(7i,ri,r 2 )(p><) = (7i,r 1 ,r 2 )(p) 
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b' does not contain counting formulas r(p) ^ 4>' P '"r < t >l P 'r ^ 2 
p\-ip(j>i p i—p ^2 pm l~r 0' 



p Hf 01 V 02 P l"r 01 v 02 P (TO) 

|{n', p' e a a T(pp') = n' a n' ip a c}\ > k 
ph-f, (a)l k ip 

\{n , p € a a r( j op') = n' A n' f/> a c}| < fc 
Vp' e a, T(pp') (^ac)v (-.^ a -.c) 

P^ (a>| fc V 



Figure 11: Global entailment relation (incl. counting formulas) 

We conclude these preliminaries by introducing some final notations. The 
root of a 0tree is defined as follows. 

root(0) = 
root((n,Ti,T 2 )) = n 

A 0tree T satisfies a formula 0, written T h- <f>, if neither ( a)t nor (< )t occur 
in rooi(r), and if there is a path p such that p Hp 0. A set of trees ST satisfies 
a formula 0, written ST i- 0, when there is a tree T e ST such that r h 0. 

4.3 The Algorithm 

We are now ready to present the algorithm, which is parameterized by K((f) 
(defined in Figure I12p , the maximum number of occurrences of a given node 
in a path from the root of the tree to a leaf. The algorithm builds consistent 
candidate trees from the bottom up, and checks at each step if one of the built 
tree satisfies the formula, returning 1 if it is the case. As the set of nodes from 
which to build the trees is finite, it eventually stops and returns if no satisfying 
tree has been found. 

To bound the size of the trees that are built, we restrict the number of 
identical nodes on a path from the root to any leaf by K(<f) + 2, defined in 
Figure [T2l using nmax defined as follows. 

nmax(n, r l5 r 2 ) = max(nmax(n,r 1 ),nmax(n,r 2 )) 
nmax(n, (n,ri,r2)) = 1 + nmax(n, T\ , T 2 ) 
nmax(n, (n', Ti, r 2 )) = nmax(n, T\, T 2 ) if n £ n 
nmax(n, 0) = 
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Algorithm 1 Check Satisfiability of <j> 
repeat 

AUX <- {(n,ri,r 2 ) | {we extend the trees} 

nmax(n,ri,r2) < K(<j>) + 2 {with an available node} 
for % in v, > {and each child is either} 

Ti = and jf n {an empty tree} 

or Ti € ST {or a previously built tree} 

(z)t £ root(Ti) {with pending backward modalities} 
R^(n,i) = root(Ti)} {checking consistency} 

if AUX g ST then 

return {No new tree was built} 

end if 

ST ^ ST u AUX 
until ST i- 4> 
return 1 



K(p) = K(-y) = K(-.{m)T) = K(l) = K{nx.ip) = 
Kfa a <f> 2 ) = Kfa v <j> 2 ) = K{fa) + K(<j> 2 ) 
K((m)<f>)=K(<f>) 
K({a) #k xb) = k + 1 

Figure 12: Occurrences bound 
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Figure 13: Checking <f) = Pi A (v)( >*)>2P2 

Consider for instance the formula cj> = p\ a ( v}( t>*)>2P2- The computed lean 
is as follows, where ip = px.(p2 a c) v ( \>)x. 

{ Pl ,p 2 ,p 3 ,c, (v>T, ( >)T, (A) T , (< >T, (v)V, ( >)4>} 

Names other than p\ and p 2 are represented by p^; c identifies counted nodes. 
Computing the bound on nodes, we get K((j>) = 3. 

After the first step, ST consists of the trees ({pi},0,0), ({ P i,c},0, 0), 
(fe,(7)T},0,0), and ({p 4 ,c,(j)T},0,0) with i e {1,2,3} and j e {v, >}. At 
this point the three finished trees in ST are tested and found not to satisfy cf>. 

After the second iteration many trees are created, but the one of interest is 
the following. 

T = ({pa, c, ( >>T, (< >T, ( >>V}, 0, ({p2,c, (< )T}, 0, 0)) 

The third iteration yields the following tree. 

r 1 = ({ P 2,C,(>)T,(A)T,(>)^},0,To) 

We can conclude by the fourth iteration when we find the tree ({pi, ( v)^; ( v)t}, Ti, 0), 
which is found to satisfy (f> at path e. As the nodes at every step are different, 
the limit is not reached. Figure Q2] depicts a graphical representation of the 
example where counted nodes (containing c) are drawn as thick circles. 

4.4 Termination 

Proving termination of the algorithm is straightforward, as only a finite number 
of trees may be built and the algorithm stops as soon as it cannot build a new 
tree. 

4.5 Soundness 

If the algorithm terminates with a candidate, we show that the initial formula 
is satisfiable. Let T and p be the 0tree and path such that p Hp <f>. We build 
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a tree from V and show that the interpretation of <j> for this tree includes the 
node at path p. 

We write T(T) for the tree (N,R,L) defined as follows. We first rewrite T 
such that each node n is replaced by the path to reach it (i.e, nodes are identified 
by their path). 

path(n,Ti,T 2 ) (e,path(v,Ti),path( >,r 2 )) 
path(p, (n,ri,r 2 )) -»■ (p,path(p\7,Ti),path(p\>,T 2 )) 
path{p, 0) -»• 

We then define: 

• N = nodes(path(T)); 

• for every (p, r^I^) in path(T) and i = v, >, if I\ + then R(p,i) = pi 
and R(pi,i) = p; and 

• for all p € N if p e T(p) then L(p) = p. 

Lemma 4.1. Let ip a subformula of <j> with no counting formula. IfT(p) if) 
then we have p e [[^]] • 

Proof. We proceed by induction on the lexical ordering of the number of un- 
folding of t/j that are required for T(T) as defined by Lemma |2~T| and of the size 
of the formula. 

The base cases are T, atomic or counting propositions, and negated forms. 
These are immediate by definition of [[^0 ■ The cases for disjunction and 
conjunction are immediate by induction (the formula is smaller). The case for 
fixpoints is also immediate by induction, as the number of unfoldings required 
decreases, and as lpx.^l {r) = [[^ x -%}]]£ (r) . 

The last case is the presence of a modality (m)i/j from the 0node T(p). In this 
case we rely on the fact that the nodes T(pm) and T(p) are consistent to derive 
T(pm) 1— ^ -0. We then conclude by induction as the formula is smaller. □ 



Theorem 4.2 (Soundness). If p Hp <f> then p e 



T(r) 



Proof. The proof proceeds by induction on the derivation of p (j>. Most cases 
are immediate (or rely on Lemma T4.ip . For the "greater than" counting case, 
we rely on the k + 1 selected nodes that have to satisfy ip a c thus i/j. In addition, 
in the "less than" case, every node that is not counted has to satisfy ->ip a -.c, 
so in particular -iip. In both cases we conclude by induction. □ 

4.6 Completeness 

Our proof proceeds in two step. We build a 0tree that satisfies the formula, then 
we show it is actually built by the algorithm. As the proof is quite complex, we 
devote some space to detail it. 
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Assume that formula <j> is satisfiable by a tree T. We consider the smallest 
such tree (i.e., the tree with the fewest number of nodes) and fix n* , a node 
witnessing satisfiability. 

We now build a 0tree homomorphic to T, called the lean labeled version of 4>, 
written F(T, </>). To this end, we start by annotating counted nodes along with 
their corresponding counting proposition, yielding a new tree T c . Starting from 
n* and by induction on 0, we proceed as follows. For formulas with no counting 
subformula, including recursion, we stop. For conjunction and disjunction of 
formulas, we recursively annotate according to both subformulas. For modal- 
ities, we recursively annotate from the node under the modality. For {a) c <k ij), 
we annotate every selected node with the counting proposition corresponding 
to the formula. For (a)> k il>, we annotate exactly k + 1 selected nodes. 

We now extend the semantics of formulas to take into account counting 
propositions and annotated nodes, written [[-]]y c . The definition is identical 
to Figure 2J with one addition and two changes. The addition is for counting 
propositions, which we define as n e [[c]]y c iff n is annotated by c. The two 
changes are for counting propositions, which we define as follows, where we 
select only nodes that are annotated. 

l(<*)*k<t>%' = K IK £ U%* n Ml%n ^ n'}\ < k} 

\L(a)> k 4>% c = K IK * U']] T v n lcf v %n^ n'}\ > k} 

We show that this modification of the semantics does no change the satisfi- 
ability of the formula. 

Lemma 4.3. We have n* e [[0]]0 C . 

Proof. We proceed by recursion on the derivation n* e [[</>J]0. The cases where 
no counting formula is involved, thus including fixpoints, are immediate, as 
the selected nodes are identical. The disjunction, conjunction, and modality 
cases are also immediate by induction. The interesting cases are the counting 
formulas. 

For (a}> k ip, as there are exactly fc+1 nodes annotated, the property is true by 
induction. For {a)< k ip 7 we rely on the fact that every counted node is annotated. 
We conclude by remarking that tp does not contain a counting formula, thus we 
have = M£ and = M v- D 

To every node n, we associate rv > , the largest subset of formulas of the lean 
selecting the node. 

= {0o | n £ [[0o]]0,0o e lean{<j))} 

This is a 0-node as it contains one and exactly one proposition, and if it 
includes a modal formula (m}ip, then it also includes (m}l. The tree T(T, <f) is 
then built homomorphically to T. 

In the remainder of this section, we write T for T(T, 4>). We now check that 
T is consistent, starting with local consistency. 
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ip e lean(4>) ipi € lean(<p) ipi € lean(jf>) 
ip e lean((f>) ^a^e lean((f>) 

ipx € lean(p) ip 2 £ lean(<j>) ip e (P^ u (m)l u C) 

ipi v ^2 e lean((j)) T 6 lean(p) ->ip e lean(p) 



Figure 14: Formula induced by a lean 



In the following, we say a formula ^ is induced by the lean of <p, written 
ip e lean((j>), if it consists of the boolean combination of subformulas from the 
lean as defined in Figure [T4l 

Lemma 4.4. Let {m)ip be a formula in lean(<fi), and let ip' be ip after unfolding 
its fixpoint formulas not under modalities. We have ip' e lean(p). 

Proof. By definition of the lean and of the 6 relation. □ 

Lemma 4.5. Let ip be a formula induced by lean(p). We have n e [["(A]]^ if 
and only if ip. 

Proof. We proceed by induction on ip. The base cases (the formula is in the 
0-node or is a negation of a lean formula not in the 0-node) hold by definition 
of r& . The inductive cases are straightforward as these formulas only contain 
fixpoints under modalities. □ 

Lemma 4.6. Let m and ni such that i?(ni,m) = ni with m e {v, >}. We 
have R^(nf, m) = n\. 

Proof. Let {m)ip be a formula in lean(p). We show that (m)ip e nt <=>• 
n\ 1— ^ ip. We have (m)ip e nt if and only if n\ € [[(to)V' lies' 2 by definition of nf , 
which in turn holds if and only if n 2 = i?(ni,m) e [[V^]]^ - We now consider ip' 
which is ip after unfolding its fixpoint formulas not under modalities. We have 
[Wig! = [M]0 C an d we conclude by Lemmas 14.41 and 1431 □ 

We now turn to global consistency, taking counting formulas into account. 

Lemma 4.7. Let p s be a subformula of <f>, and p be a path from the root in T 
such that T(p) e \[psj\0 c ■ We then have p h-t cp s . 

Proof. We proceed by induction on <p s . 

If <p s does not contain any counting formula, we consider p' s which is p s after 
unfolding its fixpoint formulas not under modalities. We have [[0s]]0 C = [[^sU^ 
and 4>' s € lean(4>). We conclude by Lemma 1431 

For most inductive cases, the proof is immediate by induction, as the formula 
size decreases. 
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For (a)> k i/j, we have by induction for every counted node pp' \-t ip and 
pp' Hp c. We conclude by the conjunction rule and by the counting rule of 
Figure [TTJ 

For (a)< fc Vj we proceed as above for the counted nodes. For the nodes that 
are not counted, we have T(pp') -,ip by Lemma [53] (since -if/ 1 s lean(cj))). We 
conclude by remarking that the node is not annotated by c, hence T(pp') 

-nC. □ 

We next show that the <^tree T is actually built by the algorithm. The proof 
follows closely the one from [TO], with a crucial exception: we need to make 
sure there are enough instances of each formula. Indeed, in [TO], the algorithm 
uses a 0type (a subset of lean(cj))) at most once on each branch from the root 
to a leaf of the built tree. This yields a simple condition to stop the algorithm 
and conclude the formula is unsatisfiable. However, in the presence of counting 
formulas, a given 0type may occur more than once on a branch. To maintain 
the termination of the algorithm, we bound the number of identical </>type that 
may be needed by K((j>) as defined in Figure [12] We thus need to check that 
this bound is sufficient to build a tree for any satisfiable formula. 

We recall that <f> is a satisfiable formula and T is a smallest tree such that 4> 
is satisfied, and n* is a witness of satisfiability. 

We proceed in two steps: first we show that counted nodes (with counted 
propositions) imply a bound on the number of identical 0types on a branch for 
a smallest tree. Second, we show that this minimal marking is bound by K(<j)). 

In the following, we call counted nodes and node n* annotations. We define 
the projection of an annotation on a path. Let p be a path from the root of the 
tree to a leaf. An annotation projects on p at p\ if p = p\p2, the annotation is 
at pxpmi an d P2 shares no prefix with p m . 

Lemma 4.8. Let Y 1 be the annotated tree, p a path from the root of the tree 
to a leaf, n\ and n% two distinct nodes of p such that nf = n^. Then either 
annotations projects both on p at n\ and ni, or an annotation projects strictly 
between m andn-i- 

Proof. We proceed by contradiction: we assume there is no annotation that 
projects between n\ and n-i and at most one of them has an annotation that 
projects on it. Without loss of generality, we assume that n 2 is below n\ in the 
tree. 

Assume neither n\ nor n 2 is annotated (through projection). We consider 
the tree T s where n 2 is "grafted" upon n%. Formally, let p\ be the path to n\ 
and pip^ the path to n 2 . We remove every node whose path is of the form p\p% 
where P2 is not a prefix of p^, and we also remove node n 2 . The mapping R! 
from nodes and modalities to nodes is the same as before for the node that are 
kept except for m, where R'(ni, v) = R(n2, v) and R'(nx, >) = R(n 2 , >). For 
every path p of T, let p s be the potentially shorter path if it exists (i.e., if it 
was not removed when pruning the tree). More precisely, if p' = p[p' 3 where p\ 
is a prefix of p\ and the paths are disjoint from there, then T s (p') = T(p'). If 
p' = P1P2P3, then r s (pip 3 ) = T(p'). 
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We now show that T s still satisfies <j> at n* , a contradiction since this tree is 
strictly smaller than Y. 

First, as there was no annotation projected, n* is still part of this tree at a 
path p s . We show that we have p s Hp tf> by induction on the derivation p Hp tj>. 

Let p' i-p (j)' in the derivation, assuming that p' s is defined. 

The case where </>' does not mention any counting formula is trivial: T(p') = 
T s (p' s ) thus local entailment is immediate. 

Conjunction and disjunction are also immediate by induction. 

We now turn to the modality case, (m}<fi' where <fi' contains a counting 
formula. If p' is neither pi nor pip2, we deduce from the fact that p' s is defined 
that (p'm) s is also defined and we conclude by induction. We now assume 
that p' is either p\ or P1P2 and find a contradiction. First, remark that p' Hp 
{m)<f)' implies that the navigation generated by (m)(j>' is in T(pi) = r(pip 2 )- As 
each syntactic occurrence of a counting formula mentions a distinct counting 
proposition c, this is possible only if the counting formula is under a fixpoint or 
under another counting formula, both of which are impossible. 

We finally turn to the counting case (a)# k ip- We say that a path does not 
cross over when this path does not contain n\ nor U2- For nodes that are 
reached using paths that do not cross over, we conclude by induction that they 
are also counted. We show that the remaining nodes reached through a crossover 
remain reachable (there cannot be any counted node in the part of the tree that 
is removed since counted nodes are annotated and there was no annotation in 
the part removed). Without loss of generality, assume that p' is a prefix of p\ 
(the counting formula is in the "top" part of the tree), and let p n be the path 
from the counting formula to the counted node (p n is an instance of the trail 
a). This path is of the shape p[p2pc, with pi = p'p[. We now show that the 
path p[p c is an instance of a if and only if p n is an instance of the trail, thus 
the same node is still reached. 

Recall that a is of the shape ai, . . . , a n ,a n +i where ct\ to a n are of the 
form a*, and where a n +i does not contain a repeated trail. We say that a 
prefix p p of a path p stops at i if there is a suffix p s such that p p p s is still a 
prefix of p, ppp s eai,...,Oj, and there is no shorter suffix p' s and j such that 
Ppp' s 6 cci, . . . , ay. (Intuitively, a* is the trail being used when matching the end 
of pp.) If there are several satisfying indices i, we consider the smallest. 

We first show that a counting proposition is necessarily mentioned in a for- 
mula of n^, by contradiction. Assume no counting proposition is mentioned, 
yet the counting crossed-over. This can only occur for a "less than" counting 
formula that reaches n2 which is not counted (because the formula was false), 
and if there is no path whose p n is a strict prefix that is an instance of a (oth- 
erwise, by definition of the lean and of nav (Figured]), a formula of the form 
nav((a'), (ip a c) v (-.ip a ->c)) would be true and thus would be present, con- 
tradicting the assumption that no counting proposition is mentioned). Since 
n i = n 2' ^ ne same is true for nf, a direct contradiction to the fact that n2 is 
also reached by the trail. Thus counting propositions are mentioned in nf and 
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We next show that there are i < j < n such that both p\ stops at i and 
P1P2 stop at j, i.e., neither i nor j may be n + 1. Recall that a n +i does not 
contain a repeated subtrail. Thus every formula of n\ mentioning c is of the 
form nav((a'), tp), where a' does not contain a repetition. We consider the 
largest such formula. Since n\ is before n 2 in the path from the counting node 
to the counted node, a similar formula with a larger trail or with a repetition 
must occur in nf, contradicting nf = . 

Consider next the suffixes pi and p 2 s computed when stating that the paths 
stop at i and j. These suffixes correspond to the path matching the end of a; 
and ay, respectively (before the next iteration or switching to the next subtrail). 
They have matching formulas in nf and n\. As the formulas are present in 
both nodes, then the remainder of the paths (p2Pc and p c ) are instances of 
(pl\Ps) a i ■ ■ ■ a n+i, thus p[p c is an instance of a if and only if p n is. 

In the case of "greater than" counting, we conclude immediately by induction 
as the same nodes are selected (thus there are enough) . In the case of "less than" , 
we need to check that no new node is counted in the smaller tree. Assume 
it is not the case for the formula (a} <k ?p, thus there is a path p n e a to a 
node satisfying As the same node can be reached in T, and as we have 
T(p' p n ) i— ^ -if/; by induction, we have a contradiction. 

This concludes the proof when neither n\ nor n-i is annotated. The proof 
is identical when ni is annotated. If n\ is annotated, we look at the first 
modality between n\ and ri2- If it is a v, then we build the smaller tree by 
doing R'{rix, v) = R(ri2, v) (we remove the I> subtree from ni instead of ni). 
Symmetrically, if the first modality is a >, we consider R'(ni, t>) = R(n2, [>) as 
smaller tree. The rest of the proof proceeds as above. □ 

Theorem 4.9 (Completeness). If <f> is satisfiable, then a satisfying tree is built. 

Proof. The proof proceeds as in [10j . we only need to check there are enough 
copies of each node to build every path. Let p be a path from the root of the 
tree to the leaves. By Lemma l4~8l there are at most n + 1 identical nodes in this 
path, where n is the number of annotations. The number of annotations is c + 1 
where c is the number of counted nodes. We show by an immediate induction 
on the formula 4> that c is bound by K {(f) as defined in Figure [T2l We conclude 
by remarking that K(4>) + 2 is the number of identical nodes we allow in the 
algorithm. □ 

4.7 Complexity 

We now show that the time complexity of the satisfiability algorithm is expo- 
nential in the formula size. This is achieved in two steps: we first show that 
the lean size is linear in the formula size, then we show that the algorithm has 
a single exponential complexity with relation to the lean size. 

Lemma 4.10. The lean size is linear in terms of the original formula size. 
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Proof Sketch. First note that the size of the lean is the number of elements it 
contains; the size of each element does not matter. 

It was shown in |10j that the size of the lean generated by a non-counting 
formula is linear with respect to the formula size. 

We now describe the case for counting formulas. The lean consists of proposi- 
tions and of modal subformulas, including the ones generated by the navigation 
of counting formulas (Figure[S]). Moreover, each counting formula adds one fresh 
counting proposition. In the case of "less than" formulas (a) <k ip, a duplication 
occurs due to the consideration of the negated normal form of tp. Since there 
is no counting under counting, this duplication and the fact that the negated 
normal form of a formula is linear in the size of the original formula (Figure [3]) 
result in the lean remaining linear. Another duplication occurs in the case of 
counting formulas of the form (ai\a2)^ k tp. This duplication does not double 
the size of the lean, however, since ip still occurs only once in the lean, thus the 
number of elements in the lean induced by nav((ai),?p) v nav((ct2), ip) is the 
same as the sum of the ones in nav((ai), ip) and in nav((a2), •)• I— ' 

Theorem 4.11. The satisfiability algorithm for the logic is decidable in time 
2°( n \ where n is the size of the lean. 

Proof Sketch. The maximum number of considered nodes is the number of dis- 
tinct tree nodes which is 2™, the number of subsets of the lean. For a given 
formula (p, the number of occurrences of the same node in the tree is bounded 
by K((f>) < k * m, where k is the greatest constant occurring in the counting 
formulas and m is the number of counting subformulas of <f>. Hence the number 
of steps of the algorithm is bounded by 2 n * k * m. 

At each iteration, the main operation performed by the algorithm is the 
composition of trees stored in AUX. The cost of each iteration consists in: 
the different searches needed to form the necessary triples (n,ri,r2), the nmax 
function and . Since the total number of nodes is exponential, and the number 
of different subtrees too, therefore the maximum number of newly formed trees 
(triples) at each step has also an exponential bound. The function nmax performs 
a single traversal of the tree which is also exponential. Since the entailment 
relation involved in the definition of is local, R^ is performed in linear time. 
Computing the containment AUX c ST and the union ST u AUX are linear 
operations over sets of exponential size. 

The stop condition of the algorithm is checked by the global entailment 
relation. It involves traversals parametrized by the number of trees, the number 
of nodes in each tree, the number of traversals for the entailment relation of 
counting formulas, and K(4>). Its time complexity is bounded by (2™ *k* m) 3 . 

Hence, the total time complexity of the algorithm is bounded by (2 n *k*m) k , 
for some constant k' . □ 
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5 Related Work 

Counting over trees The notion of Presburger Automata for trees, combin- 
ing both regular constraints on the children of nodes and numerical constraints 
given by Presburger formulas, has independently been introduced by Dal Zilio 
and Lugiez [5J and Seidl et al. [25] . Specifically, Dal Zilio and Lugiez [5J propose 
a modal logic for unordered trees called Sheaves logic. This logic allows to im- 
pose certain arithmetical constraints on children nodes but lacks recursion (i.e., 
fixpoint operators) and inverse navigation. Dal Zilio and Lugiez consider the 
satisfiability and the membership problems. Demri and Lugiez [5] showed by 
means of an automata-free decision procedure that this logic is only PSPACE- 
complctc. Restrictions like p\ nodes have no more "children" than pi nodes, are 
succinctly expressible by this approach. Seidl et al. [2 5) introduce a fixpoint 
Presburger logic, which, in addition to numerical constraints on children nodes, 
also supports recursive forward navigation. For example, expressions like the 
descendants of p\ nodes have no more "children" than the number of children of 
descendants ofp2 nodes are efficiently represented. This means that constraints 
can be imposed on sibling nodes (even if they are deep in the tree) by forward 
recursive navigation but not on distant nodes which are not siblings. 

Compared to the work presented here, neither of the two previous approaches 
can efficiently support constraints like there are more than 5 ancestors of "p" 
nodes. 

Furthermore, due to the lack of backward navigation, the works found in 
[5J (S5J IS] are n °t suited for succinctly capturing XPath expressions. Indeed, it is 
well-known that expressions with backward modalities are exponentially more 
succinct than their forward-only counterparts [111 129) . 

There is poor hope to push the decidability envelope much further for count- 
ing constraints. Indeed, it is known from |16[[6ll27] that the equivalence problem 
is undecidable for XPath expressions with counting operators of the form: 

• PathExp^ [count(PathExpr 2 ) = count(PathExpr 3 )], or 

• PathExp^ [position() = count(PathExpr 2 )]. 

This is the reason why logical frameworks that allow comparisons between count- 
ing operators limit counting by restricting the PathExpr to immediate children 
nodes [5J [25j. In this paper, we chose a different tradeoff: comparisons are 
restricted to constants but at the same time comparisons along more general 
paths are permitted. 

Counting over graphs The /^-calculus is a propositional modal logic aug- 
mented with least and greatest fixpoint operators [18] . Kupferman, Sattler and 
Vardi study a /i-calculus with graded modalities where one can express, e.g., 
that a graph node has at least n successors satisfying a certain property |19j . 
The modalities are limited in scope since they only count immediate succes- 
sors of a given node. A similar notion in trees consists in counting immediate 
children nodes, as performed by the counting formula (v)( >*)#j.</>, where <f> 
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describes the property to be counted. Compared to graded modalities of [19] . 
we consider trees and we can extend the "immediate successor" notion to nodes 
reachable from regular paths, involving reverse and recursive navigation. 

A recent study [5] focuses on extending the /^-calculus with inverse modali- 
ties [29], nominals [24], and graded modalities of [19]. If only two of the above 
constructs are considered, satisfiability of the enriched calculus is EXPTIME- 
complete [3] [T]. However, if all of the above constructs are considered simul- 
taneously, the calculus becomes undecidable [2]. The present work shows that 
this undecidability result in the case of graphs does not preclude decidable tree 
logics combining such features. 

XPath-like counting extensions The proposed logic can be the target for 
the compilation of a few more sophisticated counting features, considered as 
syntactic sugars (and that may come at the potential extra cost of their trans- 
lation) . 

In particular, XPath allows nested counting, as in the expression 

self::book[chapter[section > 1] > 1, 

which selects the current "book" node provided it has at least two "chapter" 
child nodes which in turn must contain at least two "section" nodes each. For 
a simple set of formulas, formulas that count only on children nodes, such nest- 
ing can be translated into ordinary logical formulas. For instance, the logical 
formulation of the above XPath expression can be captured as follows: 

book a ( v)/ix. (chapter a^a( t>)/iy. chapter a ip v ( >)y) v ( t>)x 

where ip = (v)^x. (section a ( >)[iy. section v ( t>)y) v ( t>)x. 

In [21], Marx introduced an "until" operator for extending XPath's expres- 
sive power to be complete with respect to first-order logic over trees. This 
operator is trivially expressible in the present logic, owing to the use of the fix- 
point binder. We can even combine counting features with the "until" operator 
and express properties that go beyond the expressive power of the XPath stan- 
dard. For instance, the following formula states that "starting from the current 
node, and until we reach an ancestor named a, every ancestor has at least 3 
children named 6" : 

fix. (( v>( t>*)> 2 k a fiy.{&)x v (< )y) v a 

These extensions come at an extra cost, however. It is not difficult to observe 
(by induction) that, given a formula <f> with subformulas ...,ip n counting 
only on children nodes, if formulas tp\,...,ip n are replaced by their expansions 
in <f), yielding a formula <p' , then \lean(<f>')\ < \lean(<f>)\ * k , where k is greatest 
numerical constraint of the counting subformulas, and I is the greatest level 
nesting of counting subformulas. As a consequence of Theorem I4.11[ the logic 
extended with nested formulas counting on children nodes and formulas counting 
on children nodes under the scope of a fixpoint operator can be decided in time 

20{n*k l ) 
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6 Conclusion 

We introduced a modal logic of trees equipped with (1) converse modalities, 
which allow to succinctly express forward and backward navigation, (2) a least 
fixpoint operator for recursion, and (3) cardinality constraint operators for ex- 
pressing numerical occurrence constraints on tree nodes satisfying some regular 
properties. A sound and complete algorithm is presented for testing satisfiabil- 
ity of logical formulas. This result is surprising since the corresponding logic for 
graphs is undecidable [3J. 

The decision procedure for the logic is exponential time w.r.t. to the formula 
size. The logic captures regular tree languages with cardinality restrictions, as 
well as the navigational fragment of XPath equipped with counting features. 
Similarly to backward modalities, numerical constraints do not extend the log- 
ical expressivity beyond regular tree languages. Nevertheless they enhance the 
succinctness of the formalism as they provide useful shorthands for otherwise 
exponentially large formulas. 

This exponential gain in succinctness makes it possible to extend static anal- 
ysis to a larger set of XPath and XML schema features in a more efficient way. 
We believe the field of application of this logic may go beyond the XML setting. 
For example, in verification of linked data structures [20l |30l [12] reasoning on 
tree structures with in-depth cardinality constraints seems a major issue. Our 
result may help building solvers that are attractive alternatives to those based 
on non-elementary logics such as SkS [28], like Mona [17] . 
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